Skip to content

refactor/metadata package#3919

Open
d-v-b wants to merge 6 commits intozarr-developers:mainfrom
d-v-b:refactor/metadata-package
Open

refactor/metadata package#3919
d-v-b wants to merge 6 commits intozarr-developers:mainfrom
d-v-b:refactor/metadata-package

Conversation

@d-v-b
Copy link
Copy Markdown
Contributor

@d-v-b d-v-b commented Apr 21, 2026

This PR creates a new subpackage called zarr-metadata just for JSON metadata. it's stored in packages/zarr_metadata. It contains typeddict classes that model the spec-defined JSON forms for v2 array + group metadata, and v3 array + group metadata, including data types, codecs, chunk key encodings and chunk grids. I only included type definitions for metadata that has an external spec. So zarr-python will need to define some types internally for e.g. unspecified data types or codecs.

I would like to publish this subpackage to pypi. These types useful to any python tool that works with zarr data, even if that tool doesn't use zarr-python. It is also useful to zarr-python, because it means we can remove and resolve some lingering questions about publishing types.

If we adopted the changes here, adding a new data type / codec / chunk grid, etc, would require adding types to zarr-metadata, then adding the implementation in zarr-python that work with those types. We wouldn't need to do these 2 operations in the same PR, but I expect that would be the normal practice.

This change does add complexity to our publishing workflow: we need to ensure that zarr-metadata changes are published at or before new zarr-python releases. We should add some checks to ensure that this happens.

Docs for the new package are missing from this PR. I would handle that in a follow-up.

I would appreciate feedback at all levels, including the following topics:

  • are we aligned on the idea of a subpackage?
  • are the types correct (do they match the spec documents)
  • what needs to change in our CI?

closes #3355 and #3795

@github-actions github-actions Bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 21, 2026
@d-v-b d-v-b force-pushed the refactor/metadata-package branch from ef4b773 to fbc24f3 Compare April 22, 2026 11:50
d-v-b and others added 2 commits April 22, 2026 13:59
Adds packages/zarr-metadata, a sibling PyPI package that contains
spec-defined Zarr v2 and v3 metadata types as pure-typing artifacts
(TypedDicts, type aliases, Final string constants, NewType validators).
No runtime logic beyond minimal regex validators for spec-format-locked
strings (hex floats, base64 bytes, raw-bytes name).

Layout (anything imported from `zarr_metadata.v3.X` is a v3-spec
artifact; from `zarr_metadata.v2.X` is v2-spec; only true cross-version
primitives sit at the top level):

  zarr_metadata/
  ├── common.py                # JSON, NamedConfig, NamedRequiredConfig
  ├── __init__.py              # ArrayMetadata, GroupMetadata unions
  ├── v2/
  │   ├── array.py, group.py, consolidated.py, codec.py
  └── v3/
      ├── array.py, group.py, consolidated.py
      ├── chunk_grid/      {regular, rectilinear}
      ├── chunk_key_encoding/ {default, v2}
      ├── codec/           {blosc, bytes, crc32c, gzip,
      │                     sharding, transpose, zstd}
      └── data_type/       {bool, int8/16/32/64, uint8/16/32/64,
                            float16/32/64, complex64/128, bytes,
                            string, numpy_datetime64, numpy_timedelta64,
                            struct, raw}

zarr-python source is unchanged in this branch. zarr-metadata is shipped
as an independent package; a follow-up PR will adopt it inside zarr-python
once the package is published to PyPI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@d-v-b d-v-b force-pushed the refactor/metadata-package branch from e6303f5 to 0ae8db9 Compare April 22, 2026 12:00
d-v-b and others added 2 commits April 22, 2026 15:22
The structural tests constructed dict literals annotated with TypedDict
types and asserted that a key we just inserted came back out -- which
exercises Python's dict, not the type. The TypedDict has no runtime
shape check, so even a type-incompatible dict would pass. Pyright (in
CI) is what actually verifies the shapes.

Constant-equality tests (e.g. `assert NAME == "name"`) also tested
nothing.

Kept the validating-constructor tests for hex_float{16,32,64},
base64_bytes, and raw_bytes_dtype_name -- those exercise real regex
logic and catch real bugs.

Replaced test_imports.py with a single smoke test that confirms the
package loads and the top-level union types are reachable. Renamed
test_structural.py -> test_validators.py to match what it actually
tests.

Test count: 47 -> 6, ~750 lines -> 89.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.11%. Comparing base (029c376) to head (331ea93).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3919   +/-   ##
=======================================
  Coverage   93.11%   93.11%           
=======================================
  Files          85       85           
  Lines       11365    11365           
=======================================
  Hits        10582    10582           
  Misses        783      783           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Apr 22, 2026

if this gets merged, we would need the following changes in subsequent PRs:

  • setting up CI / CD for zarr-metadata
  • publishing zarr-metadata to pypi
  • docs
  • changes to zarr-python:
    • depending on zarr-metadata
    • removing internally-defined types
    • deprecating re-exported public types
    • integrating zarr-python's docs with zarr-metadata's docs

@d-v-b d-v-b requested a review from a team April 22, 2026 13:41
zarr-metadata is a typing-only foundational package consumed by
libraries; supporting one Python version below the current minimum
widens the audience at minimal cost.

The only blocker was PEP 695 generic class syntax in `common.py`:

    class NamedConfig[TName: str, TConfig: ...](TypedDict):

Rewritten to the PEP 484 `Generic[T]` form, which works on 3.11+.
The two affected classes carry `# noqa: UP046` comments since the
ruff rule pushes toward the newer syntax that we deliberately avoid.

All other modern features (PEP 604 `X | Y` unions at runtime,
`NotRequired`, `extra_items=` via typing_extensions, `ReadOnly`)
already work on 3.11.

Verified: package imports and all 6 tests pass on Python 3.11.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@d-v-b d-v-b added the enhancement New features or improvements label Apr 23, 2026
@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Apr 27, 2026

@zarr-developers/python-core-devs I'd love feedback on this.

@maxrjones
Copy link
Copy Markdown
Member

@zarr-developers/python-core-devs I'd love feedback on this.

I really like the idea of having a public interface for metadata classes with minimal dependencies, and am fine with a sub-package as the mechanism for doing that.

The order of operations seems a little off to me. In particular, I'm concerned about drift over the course of the follow-up PRs, and us realizing that we're not happy with the design of zarr-metadata while migrating the core library. The following sequence would make more sense to me:

  • Open a PR with a design document for public metadata classes
  • Refactor the internal metadata classes to agree with the agreed upon API
  • Move internal metadata classes to a subpackage
  • Release subpackage
  • Depend on subpackage internally

As an aside, I'd recommend reserving the package names that you're hoping to use on PyPI in anticipation of these changes

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Apr 27, 2026

Open a PR with a design document for public metadata classes
Refactor the internal metadata classes to agree with the agreed upon API

The purpose of this PR is not to define metadata classes that zarr-python uses. It is narrowly scoped to define typeddicts that model the contents of zarr metadata documents. So this package doesn't affect anything about the shape of metadata-modelling classes in zarr-python, except by defining the type of their JSON form.

This package would only replace typeddict / literal types that we currently define inside zarr-python. Allowing this package to own those types requires no behavioral changes from zarr-python.

That's why my timeline is roughly "make this package stand alone (exist on pypi), then make zarr-python depend on it"

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Apr 27, 2026

n particular, I'm concerned about drift over the course of the follow-up PRs, and us realizing that we're not happy with the design of zarr-metadata while migrating the core library.

What aspects of the code added here do you think might drift? The only changes I anticipate would be fixes for the following issues:

  • failure to match the spec
  • failure to use python types ergonomically
  • name changes / bikeshedding

All of these can be handled gracefully by downstream packages.

As an aside, I'd recommend reserving the package names that you're hoping to use on PyPI in anticipation of these changes

good idea, this is done under my own pypi account. if this PR is approved I would transfer ownership to zarr-developers.

@maxrjones
Copy link
Copy Markdown
Member

The purpose of this PR is not to define metadata classes that zarr-python uses.

It seemed from your earlier request for feedback that this subpackage was intended to eventually be useful for zarr-python.

What aspects of the code added here do you think might drift?

Most visibly, I'd like some more thought to be put into the names of the classes. It's inconsistent that some include Metadata and others don't.

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Apr 27, 2026

It seemed from your earlier request for feedback that this subpackage was intended to eventually be useful for zarr-python.

The types and typeddicts in this package are useful for zarr python. These types define the JSON structure of the array metadata, including all the chunk grids, codecs, data types, etc. We use this information to model the JSON metadata documents we read and write.

So the impact of this package would basically be upstreaming many typeddict and literal types that zarr-python defines, e.g. this one.

Most visibly, I'd like some more thought to be put into the names of the classes. It's inconsistent that some include Metadata and others don't.

The array metadata, group metadata, and consolidated metadata typddicts are named XMetadata. That's tracking how we talk about these data structures. I'd be fine with using the metadata suffix for other types, e.g. BytesCodec -> BytesCodecMetadata.

Copy link
Copy Markdown
Member

@maxrjones maxrjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the direction here. I think we should have a very high bar for accuracy if we're promoting these as the canonical representation for public use.

For naming (recognizing this is bike-shedding), I'd prefer consistent suffix conventions:

  • Codecs all end in Codec
  • DataTypes all end in DataType
  • Chunk grids all end in ChunkGrid

I think this functionality would be best tested with some round-trip property tests

Comment thread packages/zarr-metadata/pyproject.toml
Comment on lines +19 to +24
See https://zarr-specs.readthedocs.io/en/latest/v2/v2.0.html#data-type-encoding
"""

fieldname: str
datatype: str
shape: NotRequired[tuple[int, ...]]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't the spec-faithful representation just have a list of lists? the V2 spec doesn't have JSON keys for fieldname or datatype

Comment on lines +37 to +44
data_type: MetadataField
shape: tuple[int, ...]
chunk_grid: MetadataField
chunk_key_encoding: MetadataField
fill_value: JSON
codecs: tuple[MetadataField, ...]
attributes: NotRequired[Mapping[str, JSON]]
storage_transformers: NotRequired[tuple[MetadataField, ...]]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are more restrictions on data_type, chunk_grid, chunk_key_encoding, and codecs than implied by MetadataField. This would be more useful if it were more strictly typed. In addition, MetadataField allowing a string shorthand is wrong in the chunk grid case.

"""

name: Crc32cCodecName
configuration: NotRequired[dict[str, JSON]]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this typing allows erroneous metadata (e.g., "configuration": {"foo": 1}})

clevel: int
shuffle: BloscShuffle
blocksize: int
typesize: int
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typesize should be optional because it is not required with noshuffle


Codec = str | Mapping[str, object]
"""
The widest JSON shape that can specify a codec (v2 or v3).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have a v2 or v3 codec under the v3 namespace?

from zarr_metadata.common import JSON, NamedConfig


class AllowedExtraField(TypedDict, extra_items=JSON): # type: ignore[call-arg]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd slightly prefer OptionalExtension as the name here

Comment on lines +20 to +22
"INT32_DTYPE_NAME",
"Int32DTypeName",
"Int32FillValue",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need three exports for each data type/codec? Could we start more minimal and add more if they are needed?

Comment on lines +17 to +20
DateTimeUnit = Literal[
"Y", "M", "W", "D", "h", "m", "s", "ms", "us", "μs", "ns", "ps", "fs", "as", "generic"
]
"""Time unit codes used by numpy.datetime64."""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be declared once and shared between this and timedelta

Comment on lines +36 to +47
class NamedRequiredConfig(TypedDict, Generic[TName, TConfig]): # noqa: UP046
"""
Named-config envelope with required configuration.

Generic with two parameters: name literal and configuration mapping.

Uses the PEP 484 ``Generic[T]`` form rather than PEP 695 ``[T]`` syntax
so the package supports Python 3.11.
"""

name: ReadOnly[TName]
configuration: ReadOnly[TConfig]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why include this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New features or improvements needs release notes Automatically applied to PRs which haven't added release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants